Data science tricks I learned while working in public health surveillance during the COVID-19 pandemic
Lessons from COVID-19 Relief Work
In July 2021, I took on a contract through the CDC Foundation to work at the Oklahoma State Department of Health (OSDH) as a relief worker. I was attached to the surveillance team for acute diseases to help with their collection and analysis of COVID-19 data. When I got there, there were a number of challenges that I know have been issues at other departments around the country, so I felt like there was some merit in posting my experiences and potential solutions.
In this post I am going to focus on some of the data science issues that I ran into, and other challenges (silo-ing, departmental challenges, communication, etc.) I will probably address in a separate post. If you are wondering more about how your data science skills might be needed or challenged in the field, read on!
Doing More with Less
When it comes to the challenges in data science that I found at the start of my contract, most of them pointed to a real need to be able to process more data, provide more analyses, manage a database – all with the low budget, low manpower status of a state health department. Daily tasks could take an employee up to 18 hours to complete, just so we could know how many cases were reported that day. I was luckily able to identify and improve how surveillance worked at OSDH by focusing on a few things, which I’ll say a bit more about in the following sections:
streamlining SAS code that was used
Implementing a SAS macro autocall library
Cleaning the data using SAS code - not by eye in an Excel sheet
Improving data accessibility
Compiling one master code to run the entire process
Setting the master code to run on its own
The most important thing that I probably did, was take some time to think about where the surveillance program was, where it could be, and make a plan to get it from A to B. All of it was carefully planned, and each piece built on a prior piece. This way, if I took some time to accomplish one task I had more time to accomplish subsequent tasks. I.e., if I took time to streamline SAS code, it was easier to improve the data cleaning code, and once I didn’t have to clean the data by hand, I had time to work on compiling a master code, etc.
Streamlining SAS Code
There were some things that were already being done in SAS, so I chose this as the lingua franca instead of say, R. Essentially, I took what was already being done and gave it another look with the intent of building a mental flow chart. Here, I noticed something; I had taken a number of previous computer coding classes, but that was not normal amongst my public health colleagues. They may have had an introductory course to SAS or SPSS in graduate school, but were spared the Java, C++, C, and sadly… VBA … courses that I had to take. At least two of my previous computer science instructors had emphasized using a flow chart to guide a program, Boolean or otherwise. In fact, I was once able to use Raptor, a program in which you draw flowcharts and it generates Java code based on that… I did love that tool. In any case, I believe that it is a shame that I feel like as an epidemiologist we aren’t just introduced to more efficient ways of coding.
Long story short: when you’re making a program, make a plan. Understand what you need the program to do, and in what steps it will do it. Draw a flowchart, evaluate the system you are creating, and simply follow that plan.
While it is difficult for me to find a quality, concise intro to flowcharts, this seems okay.
SAS Macros
The SAS macro language is a separate language that is separate from SAS datastep language, and because it is separate it is often thought of as “advanced,” difficult, or downright intimidating. However, there are some things you won’t be able to do in SAS otherwise, making it absolutely necessary to buckle down and get involved with macros. However, once you do, you’ll feel like a wizard.
Macros are valuable not only for creating ‘functions’ in your SAS programs that you can call repeatedly, the also allow you to share these macros across programs, projects, and people through macro libraries. This includes macros you copy and paste into that directory.
Dynamic code is also possible with macros, which is probably necessary for good data management. Dynamic SAS programs have the ability to make decisions about how they run themselves; say you don’t want a certain file to be overwritten, or the code to run under specific conditions – this is how you accomplish that. Macros can allow your program to find files, check files, move files, run or not run if a file exists, and more.
When I started my contract, I hadn’t had a lot of experience with SAS macros and had to figure much of it out as I went. A lot of my information came from Carpenter’s Complete Guide to the SAS Macro Language, and it’s a really good place to get a handle on macros so that you can really apply them.
Advanced Data Cleaning
One major challenge was to find better ways to clean our data. While it might seem like setting this completely to code might lead to errors, once we had a good SAS code that was thoroughly tested it was not only much easier than cleaning by eye in Excel of course, but much more accurate and able to find data entry errors. It’s just faster and more effective when done correctly.
Now, by correctly, there are obviously a number of things we had to take into account to classify a test result as a case, such as time between tests, repeat entries of case IDs, etc. However, the real trick was finding a way to account for misspellings of patient names. I discovered that there is a nearly 100 year old process for determining if words are misspelled, the SOUNDEX scoring function (paired with either the COMPGED or SPEDIS functions).
In perhaps more common terms, this is called “Fuzzy Matching,” and with that figured out, everything else became much easier to clean.
You can also do a lot with the LAG function, if you sort your data by specific variables, and that is often easier to use (and explain) than creating an array in SAS.
For more on fuzzy matching, click here.
For more on the LAG function, click here.
Improving Data Accessibility
While it might seem fine to start your day by downloading your data from the server with SQL Manager or Access, it is slow and it is going to prevent you from being able to set this task to run itself later.
In SAS, there is a solution for you; PROC SQL. You can basically copy and paste SQL code that you may already have into PROC SQL and ensuring the right security settings. Once this is done, you can move to having one SAS program that does it all for you.
It’s also true that you can simply use SQL language using PROC SQL; sometimes it is easier to create certain tables and datasets in SQL, and if you already know SQL, then there you go!
The one thing I have to mention is that security is obviously an issue, since the goal is to access data from a server. Since we moved to using Google Cloud platform to house our data for COVID-19, we needed a specific driver called “Simba” to interface with BigQuery. If your department is also using GCP, this might be a viable solution to securely access data from your servers.
For more on PROC SQL, click here.
For more on Simba, click here.
Having a Master Code and Automating
If you have followed figured out how to do the previously listed tasks, then you should be able to have one SAS code, that either through many datasteps or a series of macros, would be able to download the data, clean the data, do whatever you need the data to do, and then place the resulting reports and datasets wherever you need them. Not only is it nice to have a single code that does this, so that you can push button, get report – you don’t even have to push the button if you get to this point.
Let me introduce you to my favorite trick - Windows Task Scheduler. It is a part of Windows operating systems, so there is a high probability that you already have the application. In fact, if you have SAS Enterprise Guide, there is a super easy, point-and-click way in which to set tasks to run whenever you want them to. You can use it with literally any programming language and a command prompt, but I’ll continue focusing on SAS.
Bottom line: if I want to run a code every day while I sleep, I can pull up task scheduler and run it (or in SAS EP click “schedule” under the project tab), and set it to run every day at 05:00. Done.
There are a couple of things I should also say though: one, this may present a security challenge because of the required permissions to run administrative tasks on a department/business PC (I had to work with security for about 4 months to find a solution), and that the code will run… and will be set free to do whatever it is you programmed it to do. So, plan on having a lengthy talk with your cyber security team about permissions, and also test everything multiple times before allowing it to run unsupervised, lest you accidentally override or corrupt someone’s data. We never had a problem with data being lost or overridden, but I always tested everything for days leading up to implementation, and we always had a backup running until we were very comfortable with a new system.
For more on using Task Scheduler and SAS, click here.
For more on automation in general, I actually found my start in Automate the Boring Stuff with Python, and it’s where I first heard about using Task Scheduler. Click here to check out the chapter on schedulers for Windows, OS X, and (my favorite) Linux.
Let’s Wrap it Up
I hope this provided some ideas that can help you manage your data better as a public health scientist. It might seem like this is a lot once you look at the resources I’ve provided, but I would say that I learned most of these things as I went. I had looked for places that I could have the most impact, and I figured it out whether I was comfortable at first or not. If you feel like you know you or your department needs to improve with data management, then I hope you’ll believe me when I say that if I can figure this out (over a period of several months), you can too. Just take it one step at a time.
If you found this helpful, had questions, or want to see applications in other languages like R, please let me know!